# **Handling Memories Demo Script**

#### Introduction

This demonstration script provides high-level instructions on how to remove bottlenecks caused by arrays in a C design.

## Preparation:

 Required files: Necessary files are located at C:\training\hls\demos\ memory\_optimization

• Required hardware: None

Supporting materials: None

## **Handling Memories**

|   | Action with Description                                                                                                                          | Point of Emphasis and Key Takeaway                                                    |
|---|--------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------|
| • | Launch the Vivado® HLS tool.  Open the provided <b>dct_prj</b> Vivado  HLS tool project located at:  C:\training\hls\demos\  memory_optimization | You can open existing Vivado HLS tool projects from the Vivado HLS tool Welcome page. |

|   | Action with Description                                                      | Point of Emphasis and Key Takeaway                                                                                                                                                                                                                                                                        |
|---|------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| • | Access and review the source files (dct.c and dct.h) from the Explorer pane. | This C design uses a discrete cosine transformation (DCT). The function implements a 2D DCT algorithm by first processing each row of the input array via a 1D DCT, then processing the columns of the resulting array through the same 1D DCT. It calls the read_data, dct_2d, and write_data functions. |
|   |                                                                              | The <i>read_data</i> function is defined at line 54 and consists of two loops: <b>RD_Loop_Row</b> and <b>RD_Loop_Col</b> .                                                                                                                                                                                |
|   |                                                                              | The write_data function is defined at line 66 and consists of two loops to perform writing the result. The dct_2d function, defined at line 23, calls the dct_1d function and performs transpose.                                                                                                         |
|   |                                                                              | Finally, the dct_1d function, defined at line 4, uses dct_coeff_table and performs the required function by implementing a basic iterative form of the 1D Type II DCT algorithm.                                                                                                                          |

| <b>Action with Description</b>                                          | Point of Emphasis and Key Takeaway                                                                                                                                                                                                                    |
|-------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <ul><li>Run C synthesis.</li><li>Review the Synthesis report.</li></ul> | Once synthesis completes, the Synthesis report will open in the main viewing area.                                                                                                                                                                    |
|                                                                         | Notice that the estimated clock period is within the requested clock period.                                                                                                                                                                          |
|                                                                         | The Synthesis report also contains latency and throughput information of the design. The results correspond to the default solution (without any directives). You can further reduce the numbers down by specifying your requirements via directives. |
|                                                                         | The Synthesis log is available in the Console pane.                                                                                                                                                                                                   |

What is the worst-case latency of the design?

Answer: 6647

| • | Create a new solution named solution2.                | Creating a new solution allows for different optimizations to be compared                                                                                                                                                                                                                |
|---|-------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| • | Accept the default settings and click <b>Finish</b> . | easily.  There is no need to copy directives from the previous solution because the previous solution does not have any directives. Even if the directives are copied (the default setting), there would be no impact to the demo since there are no directives in the initial solution. |

As part of the optimization process, as seen in the "Pipeline for Performance" demo, you will begin by pipelining the loops in the design.

|   | Action with Description                              | Point of Emphasis and Key Takeaway                                                                                                                                  |
|---|------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| • | DCT_Outer_Loop of the dct_1d function (shown below). | Apply the PIPELINE directive on the outer loop.                                                                                                                     |
|   |                                                      | Moving the PIPELINE directive from the inner loop to the outer loop of dct_1d will lead to more parallelism of the multiply and add operations.                     |
|   |                                                      | That is, eight (8) multiply and add operations are performed concurrently, thus minimizing the number of cycles required to compute each value in the output array. |
|   |                                                      | Leave the II field blank since the design tries to target II as 1; i.e., it will try to optimize the loop to accept a new input for every cycle.                    |
|   |                                                      | You will find the directive written to the directive.tcl file under the dct_prj > solution2 > constraints folder.                                                   |
|   |                                                      | set_directive_pipeline<br>"dct_1d/DCT_Outer_Loop"                                                                                                                   |



#### **Action with Description Point of Emphasis and Key Takeaway** Similarly, apply the **PIPELINE** The Directive tab should look like the directive to the following loops: figure below after you finish applying the PIPELINE directive. • *Xpose\_Row\_Inner\_Loop* of the *dct\_2d* function \_ \_ 🔡 Outline 🔟 Directive 🖂 • Xpose\_Col\_Inner\_Loop of the ← → dct 2d function ×[] dct\_coeff\_table • RD\_Loop\_Col of the read\_data ■ BY DCT Outer Loop Ou % HLS PIPELINE function DCT\_Inner\_Loop • WR\_Loop\_Col of the write\_data x[] row\_outbuf function x[] col\_outbuf x[] col\_inbuf Row\_DCT\_Loop ■ W Xpose\_Row\_Outer\_Loop % HLS PIPELINE ■ \* Xpose\_Col\_Outer\_Loop ■ <sup>##</sup> Xpose Col Inner Loop % HLS PIPELINE ■ read\_data ■ # RD\_Loop\_Row ■ # RD\_Loop\_Col. % HLS PIPELINE write\_data ■ WR\_Loop\_Row ■ WR Loop Col % HLS PIPELINE ⊿ ⊚ dct input output x[] buf\_2d\_in x[] buf\_2d\_out Run C synthesis. Once synthesis completes, the Synthesis report will open in the main viewing area. Compare the results of two solutions This allows you to compare the different optimizations of the project. (solution1 and solution2). You should see the comparison report as shown below.

| Action with Description                                           |                      |                       |                     | Point of Emphasis and Key Takeaway |          |   |
|-------------------------------------------------------------------|----------------------|-----------------------|---------------------|------------------------------------|----------|---|
|                                                                   | Performance Estimate |                       |                     |                                    |          |   |
|                                                                   | ☐ Timing (ns)        |                       |                     |                                    |          |   |
|                                                                   | Clock                | Clock                 |                     | solution2                          | solution | 1 |
|                                                                   | ap_clk               | ap_clk Target         |                     | 4.00 4.00                          |          |   |
|                                                                   |                      | Estimat               | ed :                | 3.48                               | 3.48     |   |
|                                                                   | □ Latency            | y (clock              | cycles              | 5)                                 |          |   |
|                                                                   |                      |                       | solut               | tion2 so                           | lution1  |   |
|                                                                   | Latency              | min                   | 946                 |                                    | 47       |   |
|                                                                   |                      | max                   | 946                 |                                    | 47       |   |
|                                                                   | Interval             | min                   | 947                 |                                    | 48       |   |
|                                                                   |                      | max                   | 947                 | 66                                 | 48       |   |
|                                                                   | Utilization          | Utilization Estimates |                     |                                    |          |   |
|                                                                   |                      |                       |                     | on2 solution1                      |          |   |
|                                                                   | BRAM_18              | 3K 5                  |                     | 5                                  |          |   |
|                                                                   | DSP48E               | 8                     |                     | 1                                  |          |   |
|                                                                   | FF                   | 849                   | )                   | 384                                |          |   |
|                                                                   | LUT                  | 556                   | 5                   | 362                                |          |   |
| What is the worst-ordesign?                                       | ase lateno           | cy of t               | he                  | Answer                             | : 946    |   |
| Go to the Utilization Estimates<br>section and note the number of |                      |                       | A                   | Answer                             |          |   |
|                                                                   |                      |                       | 1                   | Number of BRAM_18K: 5              |          |   |
| DSP48E and block RAMs used to implement solution2.                |                      | 1                     | Number of DSP48E: 8 |                                    |          |   |

|   | Action with Description                                             | Point of Emphasis and Key Takeaway                                                                                                                                               |
|---|---------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| • | Select the <b>Console</b> tab and review the synthesis information. | From the Synthesis log, note that the design was not able to achieve the requested II on <i>DCT_Outer_Loop</i> because of the limited memory ports on the <i>src</i> element.    |
|   |                                                                     | The <i>src</i> is input to the <i>dct_1d</i> function and <i>dct_1d</i> is called twice in the <i>dct_2d</i> function (line no. 33 & line no. 44).                               |
|   |                                                                     | At line 33, <i>in_block</i> is accessed via the <i>src</i> element in <i>dct_1d</i> . At line no. 44, <i>col_inbuf</i> is accessed via the <i>src</i> element in <i>dct_1d</i> . |
|   |                                                                     | Therefore, you will need to partition both the <i>col_inbuf</i> and <i>in_block</i> arrays to achieve a throughput of 1.                                                         |

#### **Action with Description**

#### **Point of Emphasis and Key Takeaway**

```
📃 Console 🛭 👰 Errors 🐧 Warnings
Vivado HLS Console
 INFO: [HLS 200-10
 INFO: [HLS 200-10]
                                            -- Scheduling module 'dct_dct_1d2'
 INFO: [HLS 200-10] ------
INFO [SCHED 204-61] Pipelining loop 'DCT_Outer_Loop'.
                is [SCHED 204-69] Unable to schedule 'load' operation ('src_load_5', <u>dct.c:17</u>) on array 'src' due to limited memory ports.
              [SCHED 204-61] Pipelining result: Target II: 1, Final II: 4, Depth: 11.
INFO: [SCHED 204-11] Finished scheduling.
INFO: [HLS 200-111] Elapsed time: 0.078 seconds; current memory usage: 93.3 MB.
INFO: [HLS 200-10] ------INFO: [HLS 200-10] -- Exploring micro-architecture for module 'dct_dct_1d2'
 INFO: [BIND 205-100] Starting micro-architecture generation ...
INFO: [BIND 205-101] Performing variable lifetime analysis.
 INFO: [BIND 205-101] Exploring resource sharing.
 INFO: [BIND 205-101] Binding .
 INFO: [BIND 205-100] Finished micro-architecture generation.
INFO: [HLS 200-10] | Elapsed time: 0.057 seconds; current memory usage: 93.3 MB.
INFO: [HLS 200-10] | INFO: [HLS 200-10] | Scheduling module 'dct_dct_2d'
INFO: [HLS 200-10] | IN
INFO: [SCHED 204-11] Starting scheduling
INFO: [SCHED 204-61] Pipelining loop 'Xpose_Row_Outer_Loop_Xpose_Row_Inner_Loop
INFO: [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 4.
INFO: [SCHED 204-61] Pipelining loop 'Xpose_Col_Outer_Loop_Xpose_Col_Inner_Loop
 INFO: SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 4. INFO: [SCHED 204-11] Finished scheduling.
INFO: [BIND 205-100] Starting micro-architecture generation ...
INFO: [BIND 205-101] Performing variable lifetime analysis. INFO: [BIND 205-101] Exploring resource sharing.
INFO: [BIND 205-101] Binding ...
INFO: [BIND 205-101] Finished micro-architecture generation.
INFO: [HLS 200-111] Elapsed time: 0.053 seconds; current memory usage: 94.5 MB.
INFO: [HLS 200-10] -----INFO: [SCHED 204-11] Starting scheduling
              [SCHED 204-61] Pipelining loop 'RD_Loop_Row_RD_Loop_Col
INFO [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 4.
INFO [SCHED 204-61] Pipelining loop 'WR_Loop_Row_WR_Loop_Col'.
                [SCHED 204-61] Pipelining result: Target II: 1, Final II: 1, Depth: 4
 INFO: [SCHED 204-II] Finished Scheduling.
TNFO: [NC 200-111] Planced time: 0 00 ceconder current
```

You will now solve the *dct\_1d* pipeline II problem by increasing the memory bandwidth available to it.

This will be done by partitioning the arrays from which the *dct\_1d* inner loops read data (*in\_block* in *Row\_DCT\_Loop* and *col\_inbuf* in *Col\_DCT\_Loop*).

- Create a new solution named solution3.
- Accept the default settings and click Finish.

In this solution, you will apply the ARRAY\_PARTITION directive to buf\_2d\_in of the dct function and col\_inbuf of the dct2d function.

### **Action with Description**

# **Point of Emphasis and Key Takeaway**

 Apply the ARRAY\_PARTITION directive to buf\_2d\_in of the dct function. Partitioning large arrays into multiple smaller arrays or into individual registers can help improve access to data and remove block RAM bottlenecks.



| Action with Description                                                                                                                         | Point of Emphasis and Key Takeaway                                                                           |  |  |  |
|-------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------|--|--|--|
| <ul> <li>Similarly, apply the         ARRAY_PARTITION directive         col_inbuf of the dct_2d function.</li> </ul>                            | The Directive tab should look like the figure below after you finish applying the ARRAY_PARTITION directive. |  |  |  |
|                                                                                                                                                 | Outline Directive &                                                                                          |  |  |  |
| Run C synthesis.                                                                                                                                | Once synthesis completes, the synthesis report will open in the main viewing area.                           |  |  |  |
| Examine the Synthesis log. Has the PIPELINE II directive been met?                                                                              |                                                                                                              |  |  |  |
| Yes, the pipeline directive met II=1. The memory bandwidth increased via the array partitioning and thus the design met the requested II value. |                                                                                                              |  |  |  |
| • Compare the results of the two solutions (solution2 and solution3).                                                                           | You should see the comparison report as shown below.                                                         |  |  |  |



Latency was reduced to 548. Block RAM usage decreased to 3 from 5.

Memory utilization will usually be more after array partitioning. But in this case, some of the memory elements were implemented in the distributed RAMs/ROMS (you can observe this in the Synthesis log in the Console tab) and hence the number of block RAMs was actually reduced compared to the previous solution.

## **Summary**

Memory elements may become bottlenecks when it comes to meeting throughput and latency requirements. In this demo, you learned how to apply a partition directive on arrays in the design and observed its impact on resources and throughput.

#### References:

- Supporting materials
  - Vivado Design Suite Tutorial: High-Level Synthesis (UG871)
  - Vivado Design Suite User Guide: High-Level Synthesis (UG902)